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Abstract 

In this paper the classical “Westland” set of empiri- 
cal accelerometer helicopter data is analyzed with the 
aim of condition monitoring for diagnostic purposes. 
The goal is to determine features for failure events 
from these data, via a proprietary signal processing 
toolbox, and to weigh these according to a variety of 
classification algorithms. 

As regards signal processing, it appears that the au- 
toregressive (AR) coefficients from a simple linear 
model encapsulate a great deal of information in a rel- 
atively few measurements; it has also been found that 
augmentation of these by harmonic and other parame- 
ters can improve classification significantly. As regards 
classification, several techniques have been explored, 
among these restricted Coulomb energy (RCE) net- 
works, learning vector quantization (LVQ), Gaussian 
mixture classifiers and decision trees. A problem with 
these approaches, and in common with many classifi- 
cation paradigms, is that augmentation of the feature 
dimension can degrade classification ability. Thus, we 
also introduce the Bayesian data reduction algorithm 
(BDRA), which imposes a Dirichlet prior on training 
data and is thus able to quantify probability of error in 
an exact manner, such that features may be discarded 
or coarsened appropriately. 

1 Introduction 

Qualtech Systems has developed a suite of fault- 
isolation tools (TEAMS) which can, in real time and 
based on binary sensor data, isolate single and even 
multiple faults in complex systems. However, many 
sensors (for example, of vibration) are incapable of 
reliable decision-making on their own, and hence it 
has become necessary to develop a (real-time) signal 
processing “front-end” to the TEAMS inference en- 
gine whose goal is to render decisions as intelligent as 
possible. The signal processing system includes a wide 
menu of spectral and statistical manipulation primi- 
tives such as filters, harmonic analyzers, transient de- 
tectors, and multi-resolution decomposition. 

The signal processing kit includes pattern classifica- 


tion software, including techniques based on restricted 
Coulomb energy (RCE), decision trees (DT), learn- 
ing vector quantization (LVQ), fuzzy logic, Bayesian 
data reduction (BDRA), Gaussian mixtures (GM) and 
multi-layer perceptrons (MLP). At present the former 
three are implemented within the SP toolkit, and the 
fifth and sixth are implemented off-line in MATLAB 
using features provided by the toolkit. 

Recognition of faults can hence be automated pro- 
vided there is sufficient training data. This paper thus 
includes analysis of no-fault and seeded-fault vibra- 
tion data from a CH-46 (“SeaKnight”) helicopter aft 
gearbox as collected from a test-stand. This data is 
made freely available through the generosity of the 
Penn State ARL [8], 

Results show promising fault detection accuracy, par- 
ticularly when learning is based on auto- regressive 
(AR) coefficient features. The analysis presented in 
this paper is an outgrowth of that in [11]. In that pa- 
per, only a very abbreviated version of the Westland 
dataset was explored, and the RCE, LVQ, and DT 
schemes were discussed. In this paper the full dataset 
is used, and the set of classifiers is augmented by the 
GM and BDRA approaches. 

In section 2 we go into detail about the toolbox clas- 
sification techniques: LVQ, DT, RCE, Gaussian mix- 
ture, and BDRA classifiers. In section 3 we apply the 
signal processing and classifiers to the Westland heli- 
copter dataset. Similar to results reported elsewhere, 
we find near-perfect fault-recognition accuracy, in our 
case with relatively small feature sets involving au- 
toregressive coefficients. 


2 The Classifiers 

2.1 Restricted Coulomb Energy Clas- 
sification 

The RCE classifier [4, 9] relies on the approximation 
of a decision region via a union of hj^persphere “cells” . 
Cells may overlap if they do not belong to the same 
class, and this may produce ambiguous outputs. Note 
that partition of the observation space into decision re- 



gions is not exhaustive in the RCE approach. Training 
is iterative, and is described in [11]. After the network 
has become fixed classification is accomplished by in- 
terrogation of membership of the various cells: each 
cell is assigned a class, and the output corresponds to 
that class. For the cases that data is either a member 
of no cell, or of several which are of different classes, 
the RCE classifier gives an indeterminate output: such 
cases may be decided randomly or by heuristic. 

2.2 Learning Vector Quantization 
Classification 

The LVQ classifier [5] is a variation on the traditional 
cluster-classifier based on K-means training [10]. In 
essence, each class is assigned sub-clusters defined by 
their centroids, and data are classified based on the 
membership of the centroid to which they are nearest. 
Training is iterative, and is described fully in [11]. An 
LVQ classifier may be considered a development on 
the earlier K-means based cluster classifiers in that 
non-separable classes cause no intrinsic, and in that 
there is an intelligent means of “pruning” clusters. 

2.3 Decision Tree Classifier 

At core the DT classifier [10] produces its output by 
asking a series of questions which must have binary 
answers. The “path” taken may be thought of as tra- 
versal of a logical tree; but the form of the resultant 
decision regions must be as hyper-rectangles. In prin- 
ciple it is possible and easy to separate the training 
data precisely via a sufficiently-rich question set. In 
practice there are too many “questions” ( parameters ), 
and the DT classifier is Found not to have a partic- 
ularly good generalization ability. There are means 
to limit the number of questions, and these gener- 
ally amount to the choice of a cost to be placed on 
a question's posing. In our implementation we use 
an information-theoretic cost function, although ad- 
mittedly its basis is empirical rather than true prior 
statistics. 


be equally-likely a-priori). Note that if M = 1 this is 
identical to the quadratic discriminant classifier. 
Training is via the expectation/maximization (EM) 
algorithm [10]. The correlation matrix R is common 
to all elements of the mixture within a class of fault - 
this is known as a “homoscedastic” mixture - and the 
ideas behind this are that the number of elements to be 
estimated can be reduced and that there is little con- 
cern about unboundedness of the likelihood function. 
A variant of the above restricts R to be diagonal; this 
reduces the number of parameters to estimate consid- 
erably, but in this particular case (see, for example, 
figure 1) the ability to “tilt” the pdf level curves aris- 
ing from the use of a full R is valuable. 


2.5 BDRA Classification 

The Bayesian data reduction approach is perhaps the 
most statistically defensible of the classifiers used. It 
begins with a quantized version of the data, and as- 
sumes a Dirichlet prior (of complete ignorance) on this 
a prior, for each fault class. From that prior distribu- 
tion classification is relatively simple; the key is that 
the prior enables an explicit (and correct) probability 
of error to be calculated, and thence features may be 
pruned in an optimal way. The BDRA is discussed in 
detail in [6], among other places. Generally the BDRA 
works very well when there are too many features for 
the training data to support, and/or when the classes 
are not easily separable. 

The BDRA requires that the data be pre-quantized. 
To some extent this is not a concern, since the quanti- 
zation may be as fine as desired - the BDRA coarsens 
the quantization as part of its feature/level selection. 
For practical reasons, the quantization cannot be too 
fine, and hence it is not expected that this dataset will 
be kind to the BDRA. In fact, the BDRA results are 
reasonable, but what is interesting is its ability to se- 
lect features and its prediction of its own performance. 


3 Results on CH-46 Data 


2.4 Gaussian Mixture Classifier 


This classification technique has a greater statistical 
grounding than the previous, in that a probability den- 
sity function (pdf) is sought for each class. The spe- 
cific pdf used is a mixture of multivariate Gaussians: 
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2 = 1 
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There are M elements to the mixture, and each has a 
different mean m and prior probability tt*. Decisions 
are made according to the maximum posterior prob- 
ability of each class (in fact, classes are assumed to 


3.1 The Data 

In the early 1990’s the US Navy contracted with 
Westland, a British helicopter manufacturer, to de- 
velop and study vibration signatures for the CH-46 
(SeaKnight) aft gearbox. Essentially this is “test- 
stand” (not in-flight) data; this is a disadvantage from 
the perspective of result reliability, but offers a distinct 
advantage in that the vibration signatures are labeled. 
The data is as follows: 

• There are 68 files each containing data traces of 
100,000 samples. 


• For each case there is data available from eight 
accelerometers. 

• There are a total of nine fault conditions, rang- 
ing in severity from mild to severe. Faults were 
“seeded” (by electronic discharge milling) in the 
sense that parts with known defects were installed 
and de-installed. 

• There is data from no-fault (normal) operating 
conditions. 

• Data was observed at nine different torque levels 
(since this is a rotorcraft, angular velocities are 
relatively constant), ranging from 27% to 100%. 

For details on the faults, etc., please see [1, 8]. Note 
that if all fault levels and torques were represented 
there would be 90 files; in fact, a number of conditions 
are unrepresented in the data. As regards training 
versus testing, the entire dataset is split randomly into 
two parts, which are used separately. 

The data has been analyzed previously (e.g. [1, 7, 12]) 
using a variety of classification techniques such as 
multi-layer perceptrons and fuzzy reasoning. Indeed, 
this is apparently an “easy” (or separable) dataset 
for classification, as the reported accuracies approach 
100%. Thus, our goal here is not really to beat previous 
(unbeatable!) approaches, but to attempt to match 
them using the SP toolbox classifiers. Further, it ap- 
pears that past approaches have often used a rather 
dense feature set (several hundred features, such as 
FFT outputs), and we attempt here to use a much 
sparser arsenal. 

3.2 The Features 

3.2.1 AR Coefficients 

It is possible to use periodogram outputs explicitly 
as features for classification; however, in general this 
implies a great many features, and the usual “curse of 
dimensionality” may ensue. Since it is clear that spec- 
tral features do indeed yield much relevant informa- 
tion, we propose to use a concise way of representing 
the spectrum: the autoregressive (AR) parameters [2]. 
These are estimated on blocks of various sizes, from 
AT = 256 to N — 16384. 

Examples of AR coefficients are given in figures 1 and 
2. It is clear that there is a reasonable amount of struc- 
ture to these, but also that certain conditions cannot 
be separated reliably using only such data. In fact, 
there are 8 accelerometers from which to choose, and 
a further two AR coefficients. 

3.2.2 FFT Features 

AR coefficients are able to digest much global spectral 
information into a small dimension. There is some in- 



Figure 1: Scatter plot of AR coefficients a\ versus a 2 
(out of p — 4 AR coefficients) , for accelerometer 3 and 
combined over all torque levels, estimated on blocks of 
length 4096, for faults 3, 5, 8 and no-fault conditions. 


dication that faults may manifest in specific frequency 
behavior, and hence we additionally investigate the 
use of relative harmonic power (RHP). The i th RHP 
is the ratio of the i th - highest spectral peak (measured 
via FFT) to the average power. In the sequel we use 
4 RHP’s. Examples are given in figures 3 and 4, for 
the same conditions as figures 1 and 2. It is clear that 
these features are less a direct indication of fault class. 
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72 
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84 

34 
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30 
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Table 1 : Percentage of correct classification for three 
classifiers, versus accelerometer number - features are 

4^-order AR coefficients from individual accelerome- 
ters. 


3.3 Results for RCE, LVQ and DT 

We first examine the results for the case that ac- 
celerometers are used individually. The features used 
are AR coefficients of order p = 4, each estimated on 
a block of length N = 4096. Results are reported 
in table 1. None of these performances is acceptable, 
although accelerometers 3,4, and 7 appear to be the 
most promising. Motivated by this, we attempt to 
classify by combining accelerometers. Example results 
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Figure 2: Scatter plot of AR coefficients ai versus a<i 
(out ofp = 4 AR coefficients), for accelerometer 3 and 
combined over all torque levels, estimated on blocks of 
length 4096, for faults 4, 6, 7 and no-fault conditions. 
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2,7 

96.4 

89.4 

94.7 

3,7 

99.1 

96.7 

95.5 

4,7 

98.3 

89.6 

94.9 
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98.5 

96.6 

91 


Table 2: Percentage of correct classification for three 
classifiers, with combined accelerometer AR. coeffi- 
cients (p = 4) from individual accelerometers as fea- 
tures. (In the 1st row p = 2.) 


are shown in table 2. We find that the combination of 
accelerometers 3 and 7 is the most propitious. There is 
apparently little benefit from using all accelerometers. 
In table 3 we explore the choice of AR order. The 
results indicate that p = 4 is a good compromise be- 
tween sensitivity and dimensionality. With this choice 
we consider adding the RHP features. In table 4 we 
do, and additionally compare the results for different 
block lengths. The results become quite outstanding 
in the cases N = 4096 and N = 16384, particularly 
for the RCE classifier; the LVQ classifier is somewhat 
less satisfying, and the DT scheme has been overcome. 
Finally, we note that we have chosen to ignore the 
torque level in our classification feature set. That is, 
we have trained using combined data from all torque 
levels, and results to this point are given in terms of 
combined probability of correctness. It could be ar- 
gued that this is dangerous, in that poor performance 
may lurk at some torque level; in fact, as seen in table 
5, this is not the case. 


Figure 3: Scatter plot of RHP values, corresponding 
to figure 1. 
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Figure 4: Scatter plot of RHP values, corresponding 
to figure 2. 


3.4 Results for GM 

We show example results for the two homoscedastic 
GM classifier variants in figure 5. Apparently the GM 
classifier is not as good as the RCE scheme in this sit- 
uation; GM classifiers are often more useful when the 
data is less separable and when confidence information 
is desired, so it is perhaps interesting that the perfor- 
mance is as good as it is. Of particular note is the 
M — 1 GM classifier - this is essentially a quadratic 
discriminant, and its probability of error is very low. 
As regard the second GM classifier variant - that with 
a diagonal covariance matrix - it is interesting to ob- 
serve from figure 5 that the performance improves 
markedly as the number of mixture elements M is in- 
creased. There is some explanation of this in figure 6, 
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2 

83.2 

88.7 

87.7 

3 

94.0 

85.4 

97.4 

4 

99.1 

96.7 

95.5 

6 

99.0 

92.8 

95.0 

8 

98.9 

91.1 

95.9 

12 

97.6 

96.1 

93.8 


Table 3: Percentage of correct classification for three 
classifiers, with combined accelerometers 3 and 7, for 
various AR orders p, estimated on data blocks of 
length N = 1024. 
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RCE 

LVQ 

DT 

1024 

96.8 

95.1 

93.1 

4096 

99.6 

97.9 

93.9 

16384 

99.3 

96.7 

91.9 


Table 4: Percentage of correct classification for three 
classifiers, with combined accelerometers 3 and 7, for 
various AR orders p = 4 estimated on data blocks of 
length N. The feature set is augmented by the RHP 
spectral peak clues. 


in which the “coverage” of one class’s data by the mix- 
ture elements is illustrated. It is clear that the more 
elements, the more complete the coverage. 

3.5 Results for BDRA 

In table 6 we show the results for the BDRA in terms 
of correct detection of a fault condition no attempt is 
made here to isolate the fault, but testing is simply bi- 
nary. (The BDRA is capable of multi-class operation, 
but the version used does not support that.) Despite 
the fact that the BDRA is not particularly well-suited 
to the problem, the results are quite good. It is par- 
ticularly notable that the algorithm is able to predict 
its own performance with reasonable fidelity. 

As indicated earlier, a strength of the BDRA is that it 
is able to determine for itself & feature set. In fact, it 
is originally “given” a the entire set of features, quan- 
tized to whatever fineness is desired - in table 6 this is 
5 or 10 levels per feature, thresholded for equal proba- 
bility, meaning in the case of 10 levels and p = 6, there 
are initially 8x(6 + 4)xl0 = 800 possible observa- 
tions. In table 7 the final quantization from the BDRA 
is shown, and the dominance of accelerometers 3 and 
7 is clear. Table 7 deals only with AR coefficients: 
if RHP features are also presented to the BDRA, it 
turns out that these are often chosen. 


4 Summary 

Here we have reported on a signal processing tool- 
box specially matched to Qualtech Systems TEAMS 


torque 

RCE 

LVQ 

DT 

27% 

97.6 

94.1 

92.9 

40% 

100 

100 

98.8 

45% 

100 

100 

100 

50% 

100 

100 

97.2 

60% 

100 

98.6 

88.9 

70% 

99.1 

82.4 

83.3 

75% 

97.9 
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91.7 
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Table 5: Percentage of correct classification for three 
classifiers, with combined accelerometers 3 and 7, for 
various AR orders p — 4 estimated on data blocks of 
length N = 4096. The feature set is augmented by the 
RHP spectral peak clues. Training data is combined 
over all torque levels, and testing is done individually 
at each torque level. 



Figure 5: Probability of error performance for 

Gaussian mixture classifiers. Here p = 4 and N = 


1024. 


diagnostic inference engine, and in particular on its 
classification capability as applied to the “Westland” 
data set. We have found that essentially perfect diag- 
nostic performance is achievable via the use of AR co- 
efficient features augmented by harmonic peak infor- 
mation. The best classification performance appears 
to come from the RCE learning/classification scheme. 
The aproach works well across all torque levels, so 
there is no need to supply engine load information to 
the classifier. We have also found that the Bayesian 
data reduction (BDRA) approach, despite not being 
well-matched to the problem, works surprisingly well, 
and indeed that its ability to select features (perhaps 
for other classifiers?) is particularly promising. 








Figure 6: Dots show scatter plot of AR coefficients a\ 
versus a 2 (out of p = 4 AR coefficients), for accelerom- 
eter 3 and combined over all torque levels. Ellipses 
are probability contours for elements of diagonal-R 
Gaussian mixture fit with 8 elements. 
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98.5 
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4 

95.6 

98.8 

95.6 

98.8 

4096 

6 

92.0 

95.2 

94.1 

96.4 


Table 6: Percentage of correct fault detection for 

BDRA, using AR(p) coefficients and RHP clues. Sub- 
script of C denotes number of initial quantization lev- 
els per feature; superscript a means actual, and t is 
theoretical. 
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